Conversation
Collaborator
gnurizen
commented
Feb 21, 2026
- Extract native unwinder functions into native_stack_trace.h
- Combine python and native unwinder into single loop
- Add TraceInterceptor to tracehandler
- Symbolize CUDA traces before GPU timing fixup
Move defines (STACK_DELTA_INVALID, STACK_DELTA_STOP, NATIVE_FRAMES_PER_PROGRAM) and functions (push_native, bsearch_step, get_stack_delta_map, get_stack_delta, unwind_register_address, unwind_one_frame) from native_stack_trace.ebpf.c into native_stack_trace.h so they can be reused by other eBPF programs. Zero functional changes: stripping BTF metadata from the before/after blobs produces identical binaries, confirming no generated code changed.
Python, especially pytorch programs can exhaust the tail call limit by switching from python to native unwinders more than 29 times. This happens because of eval/delegation patterns where one python frame will be decorated with a couple native frames. In order to unwind these stack successfully fold the native unwinder into the python unwinder so at each frame a python or native frame can be unwound. Replace the separate walk_python_stack inner loop and outer transition loop with a single switch-in-loop structure using step_python and step_native helper functions. This reduces tail call usage from one per batch to one per loop budget exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations). Move native unwinder map externs (exe_id_to_*_stack_deltas, stack_delta_page_to_info, unwind_info_array) out of the TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c can include native_stack_trace.h. - PYTHON_NATIVE_LOOP_ITERS=9 chosen to pass BPF verifier on 5.4 kernels (ITERS=10 times out the verifier at >300s) - On failed PyCodeObject read, push frame with code object address so the agent can try via /proc/pid/mem
Add a TraceInterceptor callback that is invoked after ConvertTrace on cache-miss. When the interceptor returns true the trace is consumed (skipped for caching and reporting), allowing callers like the GPU subsystem to divert specific traces for further processing. Includes tests covering consume, pass-through, mixed, and non-caching behavior.
CUDA stack can sit at raw traces for awhile waiting for the fixer to match them with GPU timing information, during this time pointers in the raw traces could grow stale due to functional program GC'ing activation records. Avoid this by doing trace symbolizing before parking traces in the fixer maps. This has the nice side affect of removing some channel indirection and now traces so straight into the fixer maps and when matched they go straight to ReportTraceEvent. Move CUDA symbolization earlier in the pipeline: ConvertTrace now handles CUDA frames directly, and parcagpu.Start returns a TraceInterceptor instead of a filtered channel. The interceptor diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace, and completed traces (with timing and kernel name) are reported directly. This eliminates the Symbolize method on the CUDA interpreter in favor of demangling in prepTrace.
Collaborator
Author
|
closing, this was just and integration branch for testing... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.